评估生成的对抗网络(GANS)的表现是由于其实际意义的重要课题。虽然已经提出了几种评估指标,但它们通常会评估整个产生的图像分布的质量。对于参考标制图像合成(RIS)任务,即呈现另一参考图像的样式的源图像,其中,在评估单个生成图像的质量至关重要时,这些度量不适用于这些度量。在本文中,我们提出了一般学习的框架,参考引导图像合成评估(RISA)来定量地评估单个生成图像的质量。值得注意的是,RISA的培训不需要人类注释。具体而言,RISA的训练数据由RIS中的培训过程中的中间模型获取,并且基于图像质量与迭代之间的正相关性,通过模型迭代的数量弱写。由于该注释作为监督信号太粗糙,我们介绍了两种技术:1)一种像素 - 明智的插值方案,以改进粗标签,以及2)多个二进制分类器来替换NA \“IVE回归。此外,无人监督引入对比损失以有效地捕获所生成的图像及其参考图像之间的风格相似性。各种数据集的经验结果表明,RISA与人偏好和跨越模型的井中转移良好。
translated by 谷歌翻译
Pedestrian detection in the wild remains a challenging problem especially when the scene contains significant occlusion and/or low resolution of the pedestrians to be detected. Existing methods are unable to adapt to these difficult cases while maintaining acceptable performance. In this paper we propose a novel feature learning model, referred to as CircleNet, to achieve feature adaptation by mimicking the process humans looking at low resolution and occluded objects: focusing on it again, at a finer scale, if the object can not be identified clearly for the first time. CircleNet is implemented as a set of feature pyramids and uses weight sharing path augmentation for better feature fusion. It targets at reciprocating feature adaptation and iterative object detection using multiple top-down and bottom-up pathways. To take full advantage of the feature adaptation capability in CircleNet, we design an instance decomposition training strategy to focus on detecting pedestrian instances of various resolutions and different occlusion levels in each cycle. Specifically, CircleNet implements feature ensemble with the idea of hard negative boosting in an end-to-end manner. Experiments on two pedestrian detection datasets, Caltech and CityPersons, show that CircleNet improves the performance of occluded and low-resolution pedestrians with significant margins while maintaining good performance on normal instances.
translated by 谷歌翻译
大型预训练的神经网络无处不在,对于自然语言处理和计算机视觉中许多下游任务的成功至关重要。但是,在Web信息检索领域内,缺乏类似灵活且强大的预训练模型可以正确解析网页存在鲜明的对比。因此,我们认为,诸如内容提取和来自网页的信息挖掘之类的常见机器学习任务的收益较低,但仍未开发。我们的目标是通过引入不可知论的深图神经网络提取器来缩小差距,该图形提取器可以摄入网页结构,对大量未标记的数据进行自我监督,并对网页上的任意任务进行微调。最后,我们表明,我们的预训练模型使用两个非常不同的基准测试的多个数据集实现了最新的结果:网页清除板删除和流派分类,从而在不同的下游任务中提供了对其潜在应用的借贷支持。
translated by 谷歌翻译
动作理解已经演变成精细粒度的时代,因为现实生活中的大多数人类行为只有很小的差异。为了以标签有效的方式准确检测这些细粒度的动作,我们首次解决了视频中弱监督的细粒度临时动作检测问题。如果没有仔细的设计来捕获细粒度的动作之间的细微差异,先前的一般动作检测模型在细粒度的环境中不能很好地表现。我们建议将动作建模为可重复使用的原子动作的组合,这些动作是通过自我监督聚类自动从数据中自动发现的,以捕获细颗粒动作的共同点和个性。以视觉概念为代表的学识渊博的原子动作进一步映射到利用语义标签层次结构的细细作用标签。我们的方法构建了四个级别的视觉表示层次结构:剪辑级别,原子动作级别,精细动作类别和粗糙的动作类别水平,并在每个级别进行监督。对两个大规模细颗粒视频数据集(Fineaction和FineGym)进行了广泛的实验,显示了我们提出的弱监督模型的好处,以实现细粒度的动作检测,并实现了最先进的结果。
translated by 谷歌翻译
模型训练期间常见疾病和稀有疾病之间的数据失衡通常会导致智能诊断系统对常见疾病的预测有偏见。最先进的方法采用了两阶段的学习框架来减轻班级不平衡问题,其中第一阶段的重点是培训一般功能提取器,第二阶段的重点是对课堂的分类器负责人进行微调重新平衡。但是,现有的两阶段方法并不认为不同疾病之间的细粒度属性,通常导致第一阶段对医学图像分类的有效性低于自然图像分类任务。在这项研究中,我们建议将度量学习嵌入到两个阶段框架的第一阶段中,以帮助特征提取器学习提取更具歧视性特征表示。广泛的实验主要在三个医疗图像数据集上表明,所提出的方法始终优于现有的oneStage和两阶段方法,这表明可以将公制学习用作两阶段的插入式插件组件,用于两阶段的良好类粒度差异。图像分类任务。
translated by 谷歌翻译
视频中的动作通常涉及人类与物体的相互作用。动作标签通常由动词和名词的各种组合组成,但我们可能没有所有可能组合的培训数据。在本文中,我们旨在通过利用知识图的力量来提高组成动作识别模型在训练时间期间看不见的新型动词或新名词的概括能力。先前的工作利用了知识图中的动词 - 单词组成动作节点,因此比效率低下,因为相对于动词和名词的数量,组成动作节点的数量在四倍上增长。为了解决这个问题,我们提出了我们的方法:通过知识掩盖(黑暗)的解开行动识别,它利用了动作的固有组成。黑暗训练一个分解模型,首先提取动词和名词的解开特征表示,然后使用外部知识图中的关系预测分类权重。动词和名词之间的类型约束是从外部知识库中提取的,并在组成动作时最终应用。黑暗的对象和动词数量具有更好的可伸缩性,并在Charades数据集中实现了最新性能。我们进一步根据Epic-Kitchen数据集提出了一个新的基准分配,该数据集的类别和样本数量更大,并且该基准测试了各种模型。
translated by 谷歌翻译
知识图(kg)存储了大量的结构知识,而直接人类的理解并不容易。知识图表到文本(kg-to-text)生成旨在从kg产生易于理解的句子,同时,在生成的句子和kg之间保持语义一致性。现有的kg至文本生成方法短语此任务是线性化kg作为序列到序列生成任务作为输入的,并通过在每个解码的句子和kg节点word之间的简单选择来考虑生成的文本和kg的一致性问题时间步骤。但是,线性化的kg顺序通常是通过启发式搜索获得的,而无需数据驱动的优化。在本文中,我们根据从标题提取的顺序监督优化了知识描述顺序预测,并通过句法和语义正则化进一步增强了生成的句子和kg的一致性。我们合并了词性(POS)句法标签,以限制位置以复制kg中的单词并采用语义上下文评分函数,以评估生成句子中的每个单词时在本地上下文中每个单词的语义适应性。在两个数据集(WebNLG和DART)上进行了广泛的实验,并实现最先进的表演。
translated by 谷歌翻译
This paper focuses on designing efficient models with low parameters and FLOPs for dense predictions. Even though CNN-based lightweight methods have achieved stunning results after years of research, trading-off model accuracy and constrained resources still need further improvements. This work rethinks the essential unity of efficient Inverted Residual Block in MobileNetv2 and effective Transformer in ViT, inductively abstracting a general concept of Meta-Mobile Block, and we argue that the specific instantiation is very important to model performance though sharing the same framework. Motivated by this phenomenon, we deduce a simple yet efficient modern \textbf{I}nverted \textbf{R}esidual \textbf{M}obile \textbf{B}lock (iRMB) for mobile applications, which absorbs CNN-like efficiency to model short-distance dependency and Transformer-like dynamic modeling capability to learn long-distance interactions. Furthermore, we design a ResNet-like 4-phase \textbf{E}fficient \textbf{MO}del (EMO) based only on a series of iRMBs for dense applications. Massive experiments on ImageNet-1K, COCO2017, and ADE20K benchmarks demonstrate the superiority of our EMO over state-of-the-art methods, \eg, our EMO-1M/2M/5M achieve 71.5, 75.1, and 78.4 Top-1 that surpass \textbf{SoTA} CNN-/Transformer-based models, while trading-off the model accuracy and efficiency well.
translated by 谷歌翻译
Supervised Question Answering systems (QA systems) rely on domain-specific human-labeled data for training. Unsupervised QA systems generate their own question-answer training pairs, typically using secondary knowledge sources to achieve this outcome. Our approach (called PIE-QG) uses Open Information Extraction (OpenIE) to generate synthetic training questions from paraphrased passages and uses the question-answer pairs as training data for a language model for a state-of-the-art QA system based on BERT. Triples in the form of <subject, predicate, object> are extracted from each passage, and questions are formed with subjects (or objects) and predicates while objects (or subjects) are considered as answers. Experimenting on five extractive QA datasets demonstrates that our technique achieves on-par performance with existing state-of-the-art QA systems with the benefit of being trained on an order of magnitude fewer documents and without any recourse to external reference data sources.
translated by 谷歌翻译
Transformer has achieved impressive successes for various computer vision tasks. However, most of existing studies require to pretrain the Transformer backbone on a large-scale labeled dataset (e.g., ImageNet) for achieving satisfactory performance, which is usually unavailable for medical images. Additionally, due to the gap between medical and natural images, the improvement generated by the ImageNet pretrained weights significantly degrades while transferring the weights to medical image processing tasks. In this paper, we propose Bootstrap Own Latent of Transformer (BOLT), a self-supervised learning approach specifically for medical image classification with the Transformer backbone. Our BOLT consists of two networks, namely online and target branches, for self-supervised representation learning. Concretely, the online network is trained to predict the target network representation of the same patch embedding tokens with a different perturbation. To maximally excavate the impact of Transformer from limited medical data, we propose an auxiliary difficulty ranking task. The Transformer is enforced to identify which branch (i.e., online/target) is processing the more difficult perturbed tokens. Overall, the Transformer endeavours itself to distill the transformation-invariant features from the perturbed tokens to simultaneously achieve difficulty measurement and maintain the consistency of self-supervised representations. The proposed BOLT is evaluated on three medical image processing tasks, i.e., skin lesion classification, knee fatigue fracture grading and diabetic retinopathy grading. The experimental results validate the superiority of our BOLT for medical image classification, compared to ImageNet pretrained weights and state-of-the-art self-supervised learning approaches.
translated by 谷歌翻译